Search CORE

14 research outputs found

Misspelling Oblivious Word Embeddings

Author: Bojanowski Piotr
Edizel Bora
Ferreira Rui
Grave Edouard
Piktus Aleksandra
Silvestri Fabrizio
Publication venue
Publication date: 01/01/2019
Field of study

In this paper we present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.Comment: 9 Page

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- Università di Roma La Sapienza

PAQ: 65 Million Probably-Asked Questions and What You Can Do With Them

Author: Küttler Heinrich
Lewis Patrick
Liu Linqing
Minervini Pasquale
Piktus Aleksandra
Riedel Sebastian
Stenetorp Pontus
Wu Yuxiang
Publication venue
Publication date: 13/02/2021
Field of study

Open-domain Question Answering models which directly leverage question-answer (QA) pairs, such as closed-book QA (CBQA) models and QA-pair retrievers, show promise in terms of speed and memory compared to conventional models which retrieve and read from text corpora. QA-pair retrievers also offer interpretable answers, a high degree of control, and are trivial to update at test time with new knowledge. However, these models lack the accuracy of retrieve-and-read systems, as substantially less knowledge is covered by the available QA-pairs relative to text corpora like Wikipedia. To facilitate improved QA-pair models, we introduce Probably Asked Questions (PAQ), a very large resource of 65M automatically-generated QA-pairs. We introduce a new QA-pair retriever, RePAQ, to complement PAQ. We find that PAQ preempts and caches test questions, enabling RePAQ to match the accuracy of recent retrieve-and-read models, whilst being significantly faster. Using PAQ, we train CBQA models which outperform comparable baselines by 5%, but trail RePAQ by over 15%, indicating the effectiveness of explicit retrieval. RePAQ can be configured for size (under 500MB) or speed (over 1K questions per second) whilst retaining high accuracy. Lastly, we demonstrate RePAQ's strength at selective QA, abstaining from answering when it is likely to be incorrect. This enables RePAQ to ``back-off" to a more expensive state-of-the-art model, leading to a combined system which is both more accurate and 2x faster than the state-of-the-art model alone

arXiv.org e-Print Archive

UCL Discovery

How Decoding Strategies Affect the Verifiability of Generated Text

Author: Aleksandra Piktus
Fabio Petroni
Fabrizio Silvestri
Luca Massarelli
Myle Ott
Sebastian Riedel
Tim Rocktaschel
Vassilis Plachouras
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Recent progress in pre-trained language models led to systems that are able to generate text of an increasingly high quality. While several works have investigated the fluency and grammatical correctness of such models, it is still unclear to which extent the generated text is consistent with factual world knowledge. Here, we go beyond fluency and also investigate the verifiability of text generated by state-of-the-art pre-trained language models. A generated sentence is verifiable if it can be corroborated or disproved by Wikipedia, and we find that the verifiability of generated text strongly depends on the decoding strategy. In particular, we discover a tradeoff between factuality (i.e., the ability of generating Wikipedia corroborated text) and repetitiveness. While decoding strategies such as top-k and nucleus sampling lead to less repetitive generations, they also produce less verifiable text. Based on these finding, we introduce a simple and effective decoding strategy which, in comparison to previously used decoding strategies, produces less repetitive and more verifiable text

Crossref

Archivio della ricerca- Università di Roma La Sapienza

How Decoding Strategies Affect the Verifiability of Generated Text

Author: Massarelli Luca
Ott Myle
Petroni Fabio
Piktus Aleksandra
Plachouras Vassilis
Riedel Sebastian
Rocktäschel Tim
Silvestri Fabrizio
Publication venue
Publication date: 20/11/2019
Field of study

arXiv.org e-Print Archive

Crossref

UCL Discovery

Archivio della ricerca- Università di Roma La Sapienza

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Author: Akiki Christopher
Biderman Stella
Lin Jimmy
Ogundepo Odunayo
Oladipo Akintunde
Piktus Aleksandra
Potthast Martin
Schoelkopf Hailey
Zhang Xinyu
Publication venue
Publication date: 02/06/2023
Field of study

Noticing the urgent need to provide tools for fast and user-friendly qualitative analysis of large-scale textual corpora of the modern NLP, we propose to turn to the mature and well-tested methods from the domain of Information Retrieval (IR) - a research field with a long history of tackling TB-scale document collections. We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts. We leverage the existing functionalities of both platforms while proposing novel features further facilitating their integration. Our goal is to give NLP researchers tools that will allow them to develop retrieval-based instrumentation for their data analytics needs with ease and agility. We include a Jupyter Notebook-based walk through the core interoperability features, available on GitHub at https://github.com/huggingface/gaia. We then demonstrate how the ideas we present can be operationalized to create a powerful tool for qualitative data analysis in NLP. We present GAIA Search - a search engine built following previously laid out principles, giving access to four popular large-scale text collections. GAIA serves a dual purpose of illustrating the potential of methodologies we discuss but also as a standalone qualitative analysis tool that can be leveraged by NLP researchers aiming to understand datasets prior to using them in training. GAIA is hosted live on Hugging Face Spaces - https://huggingface.co/spaces/spacerini/gaia

arXiv.org e-Print Archive

Scaling Data-Constrained Language Models

Author: Barak Boaz
Muennighoff Niklas
Piktus Aleksandra
Pyysalo Sampo
Raffel Colin
Rush Alexander M.
Scao Teven Le
Tazi Nouamane
Wolf Thomas
Publication venue
Publication date: 30/05/2023
Field of study

The current trend of scaling language models involves increasing both parameter count and training dataset size. Extrapolating this trend suggests that training dataset size may soon be limited by the amount of text data available on the internet. Motivated by this limit, we investigate scaling language models in data-constrained regimes. Specifically, we run a large set of experiments varying the extent of data repetition and compute budget, ranging up to 900 billion training tokens and 9 billion parameter models. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. However, with more repetition, the value of adding compute eventually decays to zero. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters. Finally, we experiment with approaches mitigating data scarcity, including augmenting the training dataset with code data or removing commonly used filters. Models and datasets from our 400 training runs are freely available at https://github.com/huggingface/datablations.Comment: 47 pages (9 main), 37 figures, 13 table

arXiv.org e-Print Archive

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Author: Goyal Naman
Karpukhin Vladimir
Kiela Douwe
Küttler Heinrich
Lewis Mike
Lewis Patrick
Perez Ethan
Petroni Fabio
Piktus Aleksandra
Riedel Sebastian
Rocktäschel Tim
Yih Wen-tau
Publication venue
Publication date: 01/01/2020
Field of study

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) -- models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.Comment: Accepted at NeurIPS 202

arXiv.org e-Print Archive

UCL Discovery

KILT: a Benchmark for Knowledge Intensive Language Tasks

Author: De Cao Nicola
Fan Angela
Jernite Yacine
Karpukhin Vladimir
Lewis Patrick
Maillard Jean
Petroni Fabio
Piktus Aleksandra
Plachouras Vassilis
Riedel Sebastian
Rocktäschel Tim
Thorne James
Yazdani Majid
Publication venue
Publication date: 12/04/2021
Field of study

Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.Comment: accepted at NAACL 202

arXiv.org e-Print Archive

UCL Discovery

FinGPT: Large Generative Models for a Small Language

Author: Antao Samuel
Eskelinen Anni
Ginter Filip
Heinonen Jyrki
Kanerva Jenna
Komulainen Ville
Kupari Hanna-Mari
Laippala Veronika
Luoma Jouni
Luukkonen Risto
Merioksa Mikko
Muennighoff Niklas
Piktus Aleksandra
Pyysalo Sampo
Sairanen Samuli
Scao Teven Le
Suominen Osma
Tazi Nouamane
Vahtola Aija
Wang Thomas
Wolf Thomas
Publication venue
Publication date: 03/11/2023
Field of study

Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.Comment: 17 pages (10 main), 7 figures, 5 table

arXiv.org e-Print Archive